{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Network from scratch in PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Load and inspect data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 530,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   x_0  x_1  x_2  y\n",
      "0  1.0    0    0  0\n",
      "1  0.0    0    5  0\n",
      "2  1.0    1    3  1\n",
      "3  0.0    1    1  0\n",
      "4  0.0    1    1  1\n",
      "5  0.0    1    2  0\n",
      "6  1.0    0    1  1\n",
      "7  1.1    0    1  0\n",
      "8  1.0    0    0  1\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data_path = 'data.csv'\n",
    "data = pd.read_csv(data_path)       # dataframe\n",
    "\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Features, data points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 531,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(9, 4)\n"
     ]
    }
   ],
   "source": [
    "# FEATURES: measurable properties/attributes we can use to predict\n",
    "# check number of data points (rows) and number of features (columns except target 'y' in this case)\n",
    "\n",
    "print(data.shape)       # tuple (rows, columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 9 rows\n",
    "- 4 columns \n",
    "    - 3 features (since 'y' is a column too in this case)\n",
    "    - 3 x 9 : 27 data points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Range of features\n",
    "\n",
    "By knowing the range of each feature, we can apply proper normalization (transform feature values to a standard scale) to ensure all features contribute proportionaly during training.\n",
    "- For ex., if the range of one feature is 1000x larger that another, then during loss minimization, the gradients associated with the larger-scaled feature will likely be larger. This disproportion can cause the optimization process to overemphasize that feature, even though that feature might not actually be too influential in the prediction itself, potentially skewing weight updates and adversely affecting the overall training process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 532,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Range of features: \n",
      "('x_0', 1.1)\n",
      "('x_1', 1.0)\n",
      "('x_2', 5.0)\n"
     ]
    }
   ],
   "source": [
    "# determine range of each feature:      max - min\n",
    "\n",
    "features_columns = [col for col in data if col != 'y']\n",
    "features_ranges = {}\n",
    "\n",
    "for feature in features_columns:\n",
    "    min_val = data[feature].min()\n",
    "    max_val = data[feature].max()\n",
    "    features_ranges[feature] = float(max_val - min_val)\n",
    "\n",
    "print(\"Range of features: \")\n",
    "for features_range in features_ranges.items():\n",
    "    print(features_range)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model and package selection\n",
    "\n",
    "- Because the target column consists of 0s and 1s, this is a binary classification problem (predicting y from x features)\n",
    "    - Use a multi-layer perceptron (MLP)\n",
    "\n",
    "- Use pytorch to define, train and evaluate the model\n",
    "- Use scikit-learn to split the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 533,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "# seperate features and target\n",
    "features_values = data[features_columns].values\n",
    "target_values = data['y'].values\n",
    "\n",
    "# convert to tensors    (tensors: multidimensional homogenous data structures, good for parallelism and have many operation optimizations in packages like pytorch)\n",
    "features_values = torch.Tensor(features_values)\n",
    "target_values = torch.Tensor(target_values) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `**` Normalization / scale data `**`\n",
    "\n",
    "StandardScaler standardizes features by rescale them to have a mean of 0 and a standard deviation of 1.\n",
    "- Standardization does NOT change the shape of the data: it does NOT transform the data into a Gaussian distribution, it only standardizes the scale. The underlying distribution of the data remains unchanged\n",
    "    - i.e. DISTRIBUTION of the original data remains the same, but the numerical values are scaled such that 0 is the center/average and each data point is spread out by 1 unit\n",
    "\n",
    "<br><br>\n",
    "$x' = \\frac{x - \\mu}{\\sigma}$\n",
    "\n",
    "**Where:**\n",
    "\n",
    "- $x$ = original data point  \n",
    "- $\\mu$ = mean of the feature (before standardization)  \n",
    "- $\\sigma$ = standard deviation of the feature (before standardization)  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 534,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\nfrom sklearn.preprocessing import StandardScaler\\n\\nscaler = StandardScaler()\\nfeatures_values = scaler.fit_transform(features_values)\\n'"
      ]
     },
     "execution_count": 534,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"\"\"\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_values = scaler.fit_transform(features_values)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split the data (80% train, 20% test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 535,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "x_train, x_test, y_train, y_test = train_test_split(features_values, target_values, test_size = 0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create pytorch Dataset and Dataloader\n",
    "\n",
    "- Dataset: stores the samples and their labels\n",
    "- DataLoader: wraps an iterable around the `Dataset` to enable easy access to the samples. Makes it parallelized to load in batches to the model\n",
    "\n",
    "<br>\n",
    "\n",
    "- Batch: a subset of the training data processed together in one forward/backward pass\n",
    "    - batch size value depends on memory constraints, model size, dataset size, etc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 536,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[1., 1., 3.],\n",
      "        [1., 0., 0.]])\n",
      "torch.Size([2, 3])\n",
      "tensor([1., 1.])\n",
      "torch.Size([2])\n"
     ]
    }
   ],
   "source": [
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "\n",
    "# create datasets\n",
    "train_dataset = TensorDataset(x_train, y_train)         \n",
    "test_dataset = TensorDataset(x_test, y_test)\n",
    "\n",
    "# create dataloaders\n",
    "train_loader = DataLoader(train_dataset, batch_size=32)\n",
    "test_loader = DataLoader(test_dataset, batch_size=32)\n",
    "\n",
    "\"\"\"\n",
    "**ADD SHUFFLE**\n",
    "train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)       # shuffle tells dataloader pull images in random order (not order of dataset)\n",
    "test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)\n",
    "\"\"\"\n",
    "\n",
    "for x, y in test_loader:\n",
    "    break\n",
    "\n",
    "\n",
    "print(x)    \n",
    "\n",
    "#           tensor([[1.0000, 0.0000, 0.0000],\n",
    "#                 [1.0000, 1.0000, 3.0000],\n",
    "#                 [1.0000, 0.0000, 1.0000],\n",
    "#                 [1.1000, 0.0000, 1.0000],\n",
    "#                 [0.0000, 0.0000, 5.0000],\n",
    "#                 [1.0000, 0.0000, 0.0000],\n",
    "#                 [0.0000, 1.0000, 1.0000]])\n",
    "\n",
    "print(x.shape)  # torch.Size([7, 3])        each batch has 7 samples, each with 3 features\n",
    "\n",
    "print(y)        # tensor([1., 1., 1., 0., 0., 0., 1.])\n",
    "\n",
    "print(y.shape)  # torch.Size([7])           7 different labels corresponding to the 7 samples in this batch\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Neural Network\n",
    "\n",
    "<i>Note: Layer sizes decrease gradually to funnel information</i>\n",
    "\n",
    "**Architecture**\n",
    "\n",
    "- Input layer: (# features) neurons\n",
    "<br><br>\n",
    "- Hidden layer 1: 64 neurons (2-3x input features)\n",
    "<br><br>\n",
    "- Hidden layer 2: 32 neurons (half previous layer for gradual dimension reduction)\n",
    "<br><br>\n",
    "- Output layer: 1 neuron (for binary classification)\n",
    "<br><br>\n",
    "- Activation function: **ReLU**. max(0, x) - returns x if positive, 0 if negative\n",
    "    - prevents vanishing gradient problem (when gradients used to update the network become very slow. so network learns too slow or not at all)\n",
    "<br><br>\n",
    "- Sigmoid: squash output between 0 and 1 (for binary classification problem)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 537,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "NeuralNetwork(\n",
       "  (model): Sequential(\n",
       "    (0): Linear(in_features=3, out_features=64, bias=True)\n",
       "    (1): ReLU()\n",
       "    (2): Linear(in_features=64, out_features=32, bias=True)\n",
       "    (3): ReLU()\n",
       "    (4): Linear(in_features=32, out_features=1, bias=True)\n",
       "    (5): Sigmoid()\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 537,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch.nn as nn\n",
    "\n",
    "\"\"\"\n",
    "In this cell, see improvements at comments with `** IMPROVEMENTS:`\n",
    "\n",
    "- Dropout: neurons are zeroed out with a probability p (a hyperparameter), and those neurons produce zeroes for that forward pass. \n",
    "    since zeroing neurons reduces the overall weight of the activation values, the remaining neurons are scaled up by 1/p\n",
    "            \n",
    "\"\"\"\n",
    "\n",
    "\n",
    "class NeuralNetwork(nn.Module):         # nn.Module is the base class for all neural networks. Our model will be a subclass that inherits this superclass\n",
    "    def __init__(self, input_size):     # input_size: number of the features, `len(features_columns)`\n",
    "        super().__init__()\n",
    "\n",
    "        self.model = nn.Sequential(\n",
    "            nn.Linear(input_size, 64),  \n",
    "            # **  IMPROVEMENT: nn.BatchNorm1d(64),          # normalizes layer inputs during training  (can help with unstable loss values)\n",
    "            nn.ReLU(),                                  \n",
    "            # **  IMPROVEMENT: nn.Dropout(0.2),             # dropout: randomly sets some neurons to 0 (deactivates them) during training (helps when little data, easy to overfit)\n",
    "            # **  IMPROVEMENT: nn.ReLU(),                          \n",
    "            nn.Linear(64, 32),\n",
    "            nn.ReLU(),\n",
    "            # **  IMPROVEMENT: nn.Dropout(0.2),\n",
    "            nn.Linear(32, 1),\n",
    "            nn.Sigmoid()            \n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.model(x)\n",
    "    \n",
    "\n",
    "# initialize\n",
    "model = NeuralNetwork(len(features_columns))\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loss function\n",
    "\n",
    "Measure how inaccurate model predictions are and give gradient direction for optimization. The model **learns by MINIMIZING the loss function** (i.e. minimizing its errors).\n",
    "\n",
    "<br><br>\n",
    "$MSE = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2$\n",
    "\n",
    "**Where:**\n",
    "\n",
    "- $n$ = number of samples  \n",
    "- $y_i$ = actual (true) value  \n",
    "- $\\hat{y}_i$ = predicted value  \n",
    "  \n",
    "*scaled by* $\\frac{1}{n}$ *so the derivative is cleaner for backpropagation*\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 538,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\" \\n# For binary classification, pytorch's BCE:\\nloss = nn.BCELoss\\n\""
      ]
     },
     "execution_count": 538,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loss = nn.MSELoss()\n",
    "\n",
    "\"\"\" \n",
    "# For binary classification, pytorch's BCE:\n",
    "loss = nn.BCELoss\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient Descent\n",
    "\n",
    "Gradient descent is an optimization algorithm used to iteratively adjust parameters in order to minimize the loss function. \n",
    "- Computes the gradient (partial derivatives) of the loss function w.r.t the parameters\n",
    "- Updates parameters in the direction of steepest descent (negative gradient)\n",
    "\n",
    "<br><br>\n",
    "Learning rate (lr): scaling factor that controls how much the model updates the weights at each step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 539,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' Increase learning rate for faster convergence (convergence: reach optimal/stable model performance)\\noptimizer = optim.Adam(model.parameters(), lr=0.01)\\n'"
      ]
     },
     "execution_count": 539,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch.optim as optim\n",
    "\n",
    "optimizer = optim.Adam(model.parameters(), lr=0.0001)\n",
    "\n",
    "\"\"\" Increase learning rate for faster convergence (convergence: reach optimal/stable model performance)\n",
    "optimizer = optim.Adam(model.parameters(), lr=0.01)\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train model\n",
    "\n",
    "- 1 **epoch**: 1 complete pass through the entire training dataset\n",
    "<br><br>\n",
    "\n",
    "The model's predicted outputs appear like ex.: tensor([0.5198]).\n",
    "<br><br>\n",
    "This is the raw probability produced by the sigmoid activation. In binary classification, it's common for the model to output a continuous value between 0 or 1 (through sigmoid). These probabilities are then typically thresholded (often at 0.5) to decide the final binary class (0 or 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 540,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1, Accuracy: 0.0%\n",
      "Epoch 2, Accuracy: 0.0%\n",
      "Epoch 3, Accuracy: 0.0%\n",
      "Epoch 4, Accuracy: 0.0%\n",
      "Epoch 5, Accuracy: 0.0%\n",
      "Epoch 6, Accuracy: 0.0%\n",
      "Epoch 7, Accuracy: 0.0%\n",
      "Epoch 8, Accuracy: 0.0%\n",
      "Epoch 9, Accuracy: 0.0%\n",
      "Epoch 10, Accuracy: 0.0%\n",
      "Epoch 11, Accuracy: 0.0%\n",
      "Epoch 12, Accuracy: 0.0%\n",
      "Epoch 13, Accuracy: 0.0%\n",
      "Epoch 14, Accuracy: 0.0%\n",
      "Epoch 15, Accuracy: 0.0%\n",
      "Epoch 16, Accuracy: 0.0%\n",
      "Epoch 17, Accuracy: 0.0%\n",
      "Epoch 18, Accuracy: 0.0%\n",
      "Epoch 19, Accuracy: 0.0%\n",
      "Epoch 20, Accuracy: 0.0%\n",
      "Final accuracy: 0.0%\n"
     ]
    }
   ],
   "source": [
    "for epoch in range(20):\n",
    "    model.train()\n",
    "\n",
    "    for x_batch, y_batch in train_loader:       # for each batch (data/x and label/y) in train dataset,\n",
    "        optimizer.zero_grad()                       # resets gradients to zero before each batch\n",
    "        y_pred = model(x_batch).squeeze()          # get model prediction. Without .squeeze(), loss fails because of dimension mismatch\n",
    "        loss_val = loss(y_pred, y_batch)            # calculate loss (mse)\n",
    "        #print(f\"Loss: {loss_val.item():.4f}\")\n",
    "        loss_val.backward()                         # backpropagation\n",
    "        optimizer.step()                            # updates model weights using gradients & applies learning rate\n",
    "\n",
    "    model.eval() # put model into eval mode\n",
    "\n",
    "    with torch.no_grad():                           # disables gradient calculations (saves memory during eval, faster inference)\n",
    "        correct = 0                 \n",
    "        total = 0\n",
    "        for x_batch, y_batch in test_loader:\n",
    "            y_pred = model(x_batch).squeeze()       # ex. tensor([[0.3047],[0.1656]]) -> tensor([0.3274, 0.1942])\n",
    "            predicted = (y_pred >= 0.5).float()     # (y_pred >= 0.5) converts probabilities to boolan tensor([False, False]), .float() converts it to tensor([0., 0.])\n",
    "            total += y_batch.size(0)                         # add batch size\n",
    "            correct += (predicted == y_batch).sum().item()   # count matches\n",
    "        \n",
    "        accuracy = correct/total * 100              # calculate percentage\n",
    "        print(f\"Epoch {epoch + 1}, Accuracy: {accuracy:}%\")\n",
    "\n",
    "print(f'Final accuracy: {accuracy:}%')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### \"If this was time series how would you take that into account and train it too\"\n",
    "\n",
    "- If data were a time series, we would need to account for the **sequential order**\n",
    "    - capture **temporal dependencies** (relationships between past and future events)\n",
    "\n",
    "<br>\n",
    "\n",
    "- set `shuffle=False` for train_loader: shuffling breaks temporal dependencies by randomizing order. sequential batches allow model to learn trends/seasonality\n",
    "- set smaller batch size for train_loader: smaller batches preserve more local patterns"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ML-from-scratch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
